Unsupervised Learning Project

Context:
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:
● All the features are geometric features extracted from the silhouette.
● All are numeric in nature.

Objective:
Apply dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.

1. Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm (10 marks)

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 
from sklearn.impute import SimpleImputer
from scipy.stats import iqr
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score
from scipy.stats import zscore
from sklearn.decomposition import PCA
In [2]:
#loading data
df=pd.read_csv('vehicle.csv')
In [3]:
df.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [4]:
shape=df.shape  #Provides the Shape in (Rows, Columns) in the Data Frame df
print('shape of the data frame is =',shape)
shape of the data frame is = (846, 19)
In [5]:
#Column names
df.columns
Out[5]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')

Attribute Information

COMPACTNESS (average perim)**2/area

CIRCULARITY (average radius)**2/area

DISTANCE CIRCULARITY area/(av.distance from border)**2

RADIUS RATIO (max.rad-min.rad)/av.radius

PR.AXIS ASPECT RATIO (minor axis)/(major axis)

MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)

SCATTER RATIO (inertia about minor axis)/(inertia about major axis)

ELONGATEDNESS area/(shrink width)**2

PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)

MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)

SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS

SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS

SCALED RADIUS OF GYRATION (mavar+mivar)/area

SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS

SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS

KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 MINOR AXIS

KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 MAJOR AXIS

HOLLOWS RATIO (area of hollows)/(area of bounding polygon)

Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and

area of hollows= area of bounding poly-area of object

The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.

In [6]:
#dataframe information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [7]:
df['class'].value_counts()
Out[7]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [8]:
#Converting the Data type for the Categorical attributes from object to Category data type
df = df.astype({"class":'category'})
In [9]:
i = 0
#Length of the coulmns of the data fram
n=len(df.columns)
#List of all the attribues in the data frame 
List=list(df.columns.values)
print('Data type of each attribute of Data frame:\n')
while i < n:     
    New_List=List[i]
    Data_type=df[New_List].dtype
    print('Data Type of',New_List,'attribute is:',Data_type)
    i=i+1
Data type of each attribute of Data frame:

Data Type of compactness attribute is: int64
Data Type of circularity attribute is: float64
Data Type of distance_circularity attribute is: float64
Data Type of radius_ratio attribute is: float64
Data Type of pr.axis_aspect_ratio attribute is: float64
Data Type of max.length_aspect_ratio attribute is: int64
Data Type of scatter_ratio attribute is: float64
Data Type of elongatedness attribute is: float64
Data Type of pr.axis_rectangularity attribute is: float64
Data Type of max.length_rectangularity attribute is: int64
Data Type of scaled_variance attribute is: float64
Data Type of scaled_variance.1 attribute is: float64
Data Type of scaled_radius_of_gyration attribute is: float64
Data Type of scaled_radius_of_gyration.1 attribute is: float64
Data Type of skewness_about attribute is: float64
Data Type of skewness_about.1 attribute is: float64
Data Type of skewness_about.2 attribute is: float64
Data Type of hollows_ratio attribute is: int64
Data Type of class attribute is: category
In [10]:
le = LabelEncoder() 
df['class'] = le.fit_transform(df['class'])
df['class']
Out[10]:
0      2
1      2
2      1
3      2
4      0
      ..
841    1
842    2
843    1
844    1
845    2
Name: class, Length: 846, dtype: int32
In [11]:
df['class'].value_counts()
Out[11]:
1    429
0    218
2    199
Name: class, dtype: int64
In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null int32
dtypes: float64(14), int32(1), int64(4)
memory usage: 122.4 KB
In [13]:
print('Checking the presence of missing values in the Data frame:\n')
null_value_count = df.isnull().sum() 

i = 0

#Length of the coulmns of the data fram
n=len(df.columns)

#List of all the attribues in the data frame 
List=list(df.columns.values)

while i < n:   
    New_List=List[i]
    print('There are',null_value_count[i],'null values in',New_List,'attribute in the dataframe')

    i=i+1
Checking the presence of missing values in the Data frame:

There are 0 null values in compactness attribute in the dataframe
There are 5 null values in circularity attribute in the dataframe
There are 4 null values in distance_circularity attribute in the dataframe
There are 6 null values in radius_ratio attribute in the dataframe
There are 2 null values in pr.axis_aspect_ratio attribute in the dataframe
There are 0 null values in max.length_aspect_ratio attribute in the dataframe
There are 1 null values in scatter_ratio attribute in the dataframe
There are 1 null values in elongatedness attribute in the dataframe
There are 3 null values in pr.axis_rectangularity attribute in the dataframe
There are 0 null values in max.length_rectangularity attribute in the dataframe
There are 3 null values in scaled_variance attribute in the dataframe
There are 2 null values in scaled_variance.1 attribute in the dataframe
There are 2 null values in scaled_radius_of_gyration attribute in the dataframe
There are 4 null values in scaled_radius_of_gyration.1 attribute in the dataframe
There are 6 null values in skewness_about attribute in the dataframe
There are 1 null values in skewness_about.1 attribute in the dataframe
There are 1 null values in skewness_about.2 attribute in the dataframe
There are 0 null values in hollows_ratio attribute in the dataframe
There are 0 null values in class attribute in the dataframe

There are few missing values in most of the attributes, we assume that these values are missed by random reason and replace the missing values with Median

In [14]:
#Stats of the dataframe
df.describe().T
Out[14]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
class 846.0 0.977541 0.702130 0.0 0.00 1.0 1.0 2.0
In [15]:
df=df.replace('', np.nan)
In [16]:
newdf=df.copy()
X = df.iloc[:,0:19] 
imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)
transformed_values = imputer.fit_transform(X)
column = X.columns
print(column)
newdf = pd.DataFrame(transformed_values, columns = column )
newdf.describe().T
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')
Out[16]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.823877 6.134272 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.100473 15.741569 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.874704 33.401356 104.0 141.00 167.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.677305 7.882188 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.887707 33.197710 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.936170 7.811882 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.580378 2.588558 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.596927 31.360427 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.314421 176.496341 184.0 318.25 363.5 586.75 1018.0
scaled_radius_of_gyration 846.0 174.706856 32.546277 109.0 149.00 173.5 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.443262 7.468734 59.0 67.00 71.5 75.00 135.0
skewness_about 846.0 6.361702 4.903244 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.600473 8.930962 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.918440 6.152247 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0
class 846.0 0.977541 0.702130 0.0 0.00 1.0 1.00 2.0

We are adding a new dataframe newdf to store the value of df post missing values treatment

In [17]:
print('Checking the presence of missing values in the New Data frame:\n')
null_value_count = newdf.isnull().sum() 

i = 0

#Length of the coulmns of the data fram
n=len(newdf.columns)

#List of all the attribues in the data frame 
List=list(newdf.columns.values)

while i < n:   
    New_List=List[i]
    print('There are',null_value_count[i],'null values in',New_List,'attribute in the dataframe')

    i=i+1
Checking the presence of missing values in the New Data frame:

There are 0 null values in compactness attribute in the dataframe
There are 0 null values in circularity attribute in the dataframe
There are 0 null values in distance_circularity attribute in the dataframe
There are 0 null values in radius_ratio attribute in the dataframe
There are 0 null values in pr.axis_aspect_ratio attribute in the dataframe
There are 0 null values in max.length_aspect_ratio attribute in the dataframe
There are 0 null values in scatter_ratio attribute in the dataframe
There are 0 null values in elongatedness attribute in the dataframe
There are 0 null values in pr.axis_rectangularity attribute in the dataframe
There are 0 null values in max.length_rectangularity attribute in the dataframe
There are 0 null values in scaled_variance attribute in the dataframe
There are 0 null values in scaled_variance.1 attribute in the dataframe
There are 0 null values in scaled_radius_of_gyration attribute in the dataframe
There are 0 null values in scaled_radius_of_gyration.1 attribute in the dataframe
There are 0 null values in skewness_about attribute in the dataframe
There are 0 null values in skewness_about.1 attribute in the dataframe
There are 0 null values in skewness_about.2 attribute in the dataframe
There are 0 null values in hollows_ratio attribute in the dataframe
There are 0 null values in class attribute in the dataframe
In [18]:
print('Measure of skewness of Quantitative Data in the New Dataframe newdf')
i = 0
List=list(newdf.columns.values)
n=len(List)
while i < n:   
    New_List=List[i]
    skew=newdf[New_List].skew(axis = 0, skipna = True)
    if (skew==0):
        conclusion='Data is normally distributed or Symmetric'
    elif(skew<0):
        conclusion='Data is Left-Skewed'
    else:
        conclusion='Data is Right-Skewed'     
    print('Skewness of',New_List,'is: %.3f'%skew,'and',conclusion)
    i=i+1
Measure of skewness of Quantitative Data in the New Dataframe newdf
Skewness of compactness is: 0.381 and Data is Right-Skewed
Skewness of circularity is: 0.265 and Data is Right-Skewed
Skewness of distance_circularity is: 0.109 and Data is Right-Skewed
Skewness of radius_ratio is: 0.398 and Data is Right-Skewed
Skewness of pr.axis_aspect_ratio is: 3.835 and Data is Right-Skewed
Skewness of max.length_aspect_ratio is: 6.778 and Data is Right-Skewed
Skewness of scatter_ratio is: 0.609 and Data is Right-Skewed
Skewness of elongatedness is: 0.047 and Data is Right-Skewed
Skewness of pr.axis_rectangularity is: 0.774 and Data is Right-Skewed
Skewness of max.length_rectangularity is: 0.256 and Data is Right-Skewed
Skewness of scaled_variance is: 0.656 and Data is Right-Skewed
Skewness of scaled_variance.1 is: 0.845 and Data is Right-Skewed
Skewness of scaled_radius_of_gyration is: 0.280 and Data is Right-Skewed
Skewness of scaled_radius_of_gyration.1 is: 2.090 and Data is Right-Skewed
Skewness of skewness_about is: 0.781 and Data is Right-Skewed
Skewness of skewness_about.1 is: 0.689 and Data is Right-Skewed
Skewness of skewness_about.2 is: 0.250 and Data is Right-Skewed
Skewness of hollows_ratio is: -0.226 and Data is Left-Skewed
Skewness of class is: 0.031 and Data is Right-Skewed
In [19]:
print('Checking the presence of outliers of Quantitative Data in the New Dataframe newdf')
i = 0
total_outliers=0
List=list(newdf.columns.values)
n=len(List)
while i < n:   
    New_List=List[i]
    minimum,q1,q3,maximum= np.percentile(newdf[New_List],[0,25,75,100])
    iqr=q3-q1
    lower_value=q1-(1.5 * iqr)
    upper_value=q3+(1.5 * iqr)
    if ((minimum<lower_value) or (maximum>upper_value)):
        outliers = [x for x in df[New_List] if x < lower_value or x > upper_value]
        print('Identified outliers for',New_List,'out of', len(newdf[New_List]),'records: %d' % len(outliers))       
        total_outliers=total_outliers+len(outliers)
    else:
        print('There is no outlier for the attribute',New_List)        
    i=i+1
print('Total number of outliers are:',total_outliers)
Checking the presence of outliers of Quantitative Data in the New Dataframe newdf
There is no outlier for the attribute compactness
There is no outlier for the attribute circularity
There is no outlier for the attribute distance_circularity
Identified outliers for radius_ratio out of 846 records: 3
Identified outliers for pr.axis_aspect_ratio out of 846 records: 8
Identified outliers for max.length_aspect_ratio out of 846 records: 13
There is no outlier for the attribute scatter_ratio
There is no outlier for the attribute elongatedness
There is no outlier for the attribute pr.axis_rectangularity
There is no outlier for the attribute max.length_rectangularity
Identified outliers for scaled_variance out of 846 records: 1
Identified outliers for scaled_variance.1 out of 846 records: 2
There is no outlier for the attribute scaled_radius_of_gyration
Identified outliers for scaled_radius_of_gyration.1 out of 846 records: 15
Identified outliers for skewness_about out of 846 records: 12
Identified outliers for skewness_about.1 out of 846 records: 1
There is no outlier for the attribute skewness_about.2
There is no outlier for the attribute hollows_ratio
There is no outlier for the attribute class
Total number of outliers are: 55

The below columns have outliers.

radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, scaled_radius_of_gyration.1 skewness_about skewness_about.1

In [20]:
#Checking for the outliers using boxplot
i = 0
List=list(newdf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n: 
    New_List=List[i]
    plt.subplot(9,2,i+1)
    sns.boxplot(newdf[New_List])
    i=i+1 
plt.show()

Outlier Treatment using a new Data frame cleandf

In [21]:
Q1 = newdf.quantile(0.25)
Q3 = newdf.quantile(0.75)
IQR = Q3 - Q1
print('upper_value:\n',Q3+1.5*IQR)
print('\nLower value:\n',Q1-1.5*IQR)
cleandf = newdf[~((newdf < (Q1 - 1.5 * IQR)) |(newdf > (Q3 + 1.5 * IQR))).any(axis=1)]
upper_value:
 compactness                    119.500
circularity                     62.500
distance_circularity           140.000
radius_ratio                   276.000
pr.axis_aspect_ratio            77.000
max.length_aspect_ratio         14.500
scatter_ratio                  274.500
elongatedness                   65.500
pr.axis_rectangularity          29.000
max.length_rectangularity      192.000
scaled_variance                292.000
scaled_variance.1              989.500
scaled_radius_of_gyration      271.500
scaled_radius_of_gyration.1     87.000
skewness_about                  19.500
skewness_about.1                40.000
skewness_about.2               206.500
hollows_ratio                  217.125
class                            2.500
dtype: float64

Lower value:
 compactness                     67.500
circularity                     26.500
distance_circularity            28.000
radius_ratio                    60.000
pr.axis_aspect_ratio            45.000
max.length_aspect_ratio          2.500
scatter_ratio                   70.500
elongatedness                   13.500
pr.axis_rectangularity          13.000
max.length_rectangularity      104.000
scaled_variance                 92.000
scaled_variance.1              -84.500
scaled_radius_of_gyration       75.500
scaled_radius_of_gyration.1     55.000
skewness_about                  -8.500
skewness_about.1               -16.000
skewness_about.2               170.500
hollows_ratio                  174.125
class                           -1.500
dtype: float64
In [22]:
cleandf
Out[22]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95.0 48.0 83.0 178.0 72.0 10.0 162.0 42.0 20.0 159.0 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197.0 2.0
1 91.0 41.0 84.0 141.0 57.0 9.0 149.0 45.0 19.0 143.0 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199.0 2.0
2 104.0 50.0 106.0 209.0 66.0 10.0 207.0 32.0 23.0 158.0 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196.0 1.0
3 93.0 41.0 82.0 159.0 63.0 9.0 144.0 46.0 19.0 143.0 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207.0 2.0
5 107.0 44.0 106.0 172.0 50.0 6.0 255.0 26.0 28.0 169.0 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
841 93.0 39.0 87.0 183.0 64.0 8.0 169.0 40.0 20.0 134.0 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195.0 1.0
842 89.0 46.0 84.0 163.0 66.0 11.0 159.0 43.0 20.0 159.0 173.0 368.0 176.0 72.0 1.0 20.0 186.0 197.0 2.0
843 106.0 54.0 101.0 222.0 67.0 12.0 222.0 30.0 25.0 173.0 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201.0 1.0
844 86.0 36.0 78.0 146.0 58.0 7.0 135.0 50.0 18.0 124.0 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195.0 1.0
845 85.0 36.0 66.0 123.0 55.0 5.0 120.0 56.0 17.0 128.0 140.0 212.0 131.0 73.0 1.0 18.0 186.0 190.0 2.0

813 rows × 19 columns

In [23]:
print('Checking the presence of outliers of Quantitative Data in the clean Dataframe post outlier treatment')
i = 0
total_outliers=0
List=list(cleandf.columns.values)
n=len(List) 
while i < n:   
    New_List=List[i]
    minimum,q1,q3,maximum= np.percentile(cleandf[New_List],[0,25,75,100])
    iqr=q3-q1
    lower_value=q1-(1.5 * iqr)
    upper_value=q3+(1.5 * iqr)
    if ((minimum<lower_value) or (maximum>upper_value)):
        outliers = [x for x in cleandf[New_List] if x < lower_value or x > upper_value]
        print('Identified outliers for',New_List,'out of', len(cleandf[New_List]),'records: %d' % len(outliers))       
        total_outliers=total_outliers+len(outliers)
    else:
        print('There is no outlier for the attribute',New_List)        
    i=i+1
print('Total number of outliers are:',total_outliers)
Checking the presence of outliers of Quantitative Data in the clean Dataframe post outlier treatment
There is no outlier for the attribute compactness
There is no outlier for the attribute circularity
There is no outlier for the attribute distance_circularity
There is no outlier for the attribute radius_ratio
There is no outlier for the attribute pr.axis_aspect_ratio
There is no outlier for the attribute max.length_aspect_ratio
There is no outlier for the attribute scatter_ratio
There is no outlier for the attribute elongatedness
There is no outlier for the attribute pr.axis_rectangularity
There is no outlier for the attribute max.length_rectangularity
There is no outlier for the attribute scaled_variance
Identified outliers for scaled_variance.1 out of 813 records: 1
There is no outlier for the attribute scaled_radius_of_gyration
There is no outlier for the attribute scaled_radius_of_gyration.1
There is no outlier for the attribute skewness_about
There is no outlier for the attribute skewness_about.1
There is no outlier for the attribute skewness_about.2
There is no outlier for the attribute hollows_ratio
There is no outlier for the attribute class
Total number of outliers are: 1

Most of the Outlier are removed, and the one outlier which is availabe in the scaled_variance.1 is from the previous data, which was not an outlier earlier and can be ignored

In [24]:
#Checking for the outliers in cleandf using boxplot
i = 0
List=list(cleandf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n: 
    New_List=List[i]
    plt.subplot(9,2,i+1)
    sns.boxplot(cleandf[New_List])
    i=i+1 
plt.show()

We can proceed with either the newdf with Outlier or the cleandf in which we have treated the outliers. The outliers are very few and they are not so unrealistic. Hence, we need not remove them since the prediction model should represent the real world. This improves the generalizability of the model and makes it robust for real world situations. The outliers, therefore, are not removed and we will proceed with newdf.

2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (10 points)

In [25]:
# Histogram Plot of Quantitative Data
i = 0
List=list(newdf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n: 
    New_List=List[i]
    plt.subplot(9,2,i+1)
    plt.hist(newdf[New_List],edgecolor = 'black')
    plt.xlabel(New_List)
    i=i+1
plt.show() 

Quick Observation :

Most of the data attributes seems to be normally distributed
Few Many of the attributes are Right Skewed as mentioned earlier in skewness check.

In [26]:
# Strip Plot of Quantitative Data
i = 0
List=list(newdf.iloc[:,0:18].columns.values)
n=len(List)
plt.figure(figsize= (12,30))
while i<n: 
    New_List=List[i]
    plt.subplot(9, 2, i+1)
    sns.stripplot(df['class'],df[New_List])    
    plt.xlabel(New_List)
    i=i+1
plt.show()
In [27]:
#We will use Pearson Correlation Coefficient to see what all attributes are linearly related 
newdf.corr()
Out[27]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
compactness 1.000000 0.684887 0.789928 0.689743 0.091534 0.148249 0.812620 -0.788750 0.813694 0.676143 0.762070 0.814012 0.585243 -0.249593 0.236078 0.157015 0.298537 0.365552 -0.033796
circularity 0.684887 1.000000 0.792320 0.620912 0.153778 0.251467 0.847938 -0.821472 0.843400 0.961318 0.796306 0.835946 0.925816 0.051946 0.144198 -0.011439 -0.104426 0.046351 -0.158910
distance_circularity 0.789928 0.792320 1.000000 0.767035 0.158456 0.264686 0.905076 -0.911307 0.893025 0.774527 0.861519 0.886017 0.705771 -0.225944 0.113924 0.265547 0.146098 0.332732 -0.064467
radius_ratio 0.689743 0.620912 0.767035 1.000000 0.663447 0.450052 0.734429 -0.789481 0.708385 0.568949 0.793415 0.718436 0.536372 -0.180397 0.048713 0.173741 0.382214 0.471309 -0.182186
pr.axis_aspect_ratio 0.091534 0.153778 0.158456 0.663447 1.000000 0.648724 0.103732 -0.183035 0.079604 0.126909 0.272910 0.089189 0.121971 0.152950 -0.058371 -0.031976 0.239886 0.267725 -0.098178
max.length_aspect_ratio 0.148249 0.251467 0.264686 0.450052 0.648724 1.000000 0.166191 -0.180140 0.161502 0.305943 0.318957 0.143253 0.189743 0.295735 0.015599 0.043422 -0.026081 0.143919 0.207619
scatter_ratio 0.812620 0.847938 0.905076 0.734429 0.103732 0.166191 1.000000 -0.971601 0.989751 0.809083 0.948662 0.993012 0.799875 -0.027542 0.074458 0.212428 0.005628 0.118817 -0.288895
elongatedness -0.788750 -0.821472 -0.911307 -0.789481 -0.183035 -0.180140 -0.971601 1.000000 -0.948996 -0.775854 -0.936382 -0.953816 -0.766314 0.103302 -0.052600 -0.185053 -0.115126 -0.216905 0.339344
pr.axis_rectangularity 0.813694 0.843400 0.893025 0.708385 0.079604 0.161502 0.989751 -0.948996 1.000000 0.810934 0.934227 0.988213 0.796690 -0.015495 0.083767 0.214700 -0.018649 0.099286 -0.258481
max.length_rectangularity 0.676143 0.961318 0.774527 0.568949 0.126909 0.305943 0.809083 -0.775854 0.810934 1.000000 0.744985 0.794615 0.866450 0.041622 0.135852 0.001366 -0.103948 0.076770 -0.032399
scaled_variance 0.762070 0.796306 0.861519 0.793415 0.272910 0.318957 0.948662 -0.936382 0.934227 0.744985 1.000000 0.945678 0.778917 0.113078 0.036729 0.194239 0.014219 0.085695 -0.312943
scaled_variance.1 0.814012 0.835946 0.886017 0.718436 0.089189 0.143253 0.993012 -0.953816 0.988213 0.794615 0.945678 1.000000 0.795017 -0.015401 0.076877 0.200811 0.006219 0.102935 -0.288115
scaled_radius_of_gyration 0.585243 0.925816 0.705771 0.536372 0.121971 0.189743 0.799875 -0.766314 0.796690 0.866450 0.778917 0.795017 1.000000 0.191473 0.166483 -0.056153 -0.224450 -0.118002 -0.250267
scaled_radius_of_gyration.1 -0.249593 0.051946 -0.225944 -0.180397 0.152950 0.295735 -0.027542 0.103302 -0.015495 0.041622 0.113078 -0.015401 0.191473 1.000000 -0.088355 -0.126183 -0.748865 -0.802123 -0.212601
skewness_about 0.236078 0.144198 0.113924 0.048713 -0.058371 0.015599 0.074458 -0.052600 0.083767 0.135852 0.036729 0.076877 0.166483 -0.088355 1.000000 -0.034990 0.115297 0.097126 0.119581
skewness_about.1 0.157015 -0.011439 0.265547 0.173741 -0.031976 0.043422 0.212428 -0.185053 0.214700 0.001366 0.194239 0.200811 -0.056153 -0.126183 -0.034990 1.000000 0.077310 0.204990 -0.010680
skewness_about.2 0.298537 -0.104426 0.146098 0.382214 0.239886 -0.026081 0.005628 -0.115126 -0.018649 -0.103948 0.014219 0.006219 -0.224450 -0.748865 0.115297 0.077310 1.000000 0.892581 0.067244
hollows_ratio 0.365552 0.046351 0.332732 0.471309 0.267725 0.143919 0.118817 -0.216905 0.099286 0.076770 0.085695 0.102935 -0.118002 -0.802123 0.097126 0.204990 0.892581 1.000000 0.235874
class -0.033796 -0.158910 -0.064467 -0.182186 -0.098178 0.207619 -0.288895 0.339344 -0.258481 -0.032399 -0.312943 -0.288115 -0.250267 -0.212601 0.119581 -0.010680 0.067244 0.235874 1.000000
In [28]:
plt.figure(figsize=(15,15))
sns.heatmap(newdf.corr(),annot=True,square=True,fmt='.2f')
plt.show()
In [29]:
print('Insights From Correlation Heatmap:\n');
print('Attributes with high correlation of greater than 0.9 or less than 0.9:\n');
a=newdf.corr()
i = 0
j = 0
c = 0
n=len(a.columns)
col=list(a.columns.values)
ind=list(a.index.values)
while i < n:
    sInd=ind[i]
    while j < n:
        sCol=col[j]
        value=a.loc[sInd,sCol]
        if(((value>0.9) or (value<-0.9))&(sInd!=sCol)):
            print('Correlation between',sInd,'&',sCol,'is',a.loc[sInd,sCol])
        j=j+1
    c=c+1
    j = c
    i=i+1
Insights From Correlation Heatmap:

Attributes with high correlation of greater than 0.9 or less than 0.9:

Correlation between circularity & max.length_rectangularity is 0.9613180653243628
Correlation between circularity & scaled_radius_of_gyration is 0.9258160243502346
Correlation between distance_circularity & scatter_ratio is 0.9050757734130161
Correlation between distance_circularity & elongatedness is -0.91130693226804
Correlation between scatter_ratio & elongatedness is -0.9716008640363396
Correlation between scatter_ratio & pr.axis_rectangularity is 0.9897505102299368
Correlation between scatter_ratio & scaled_variance is 0.948662306793849
Correlation between scatter_ratio & scaled_variance.1 is 0.9930115357055442
Correlation between elongatedness & pr.axis_rectangularity is -0.9489958637861852
Correlation between elongatedness & scaled_variance is -0.9363818383742168
Correlation between elongatedness & scaled_variance.1 is -0.9538160951670767
Correlation between pr.axis_rectangularity & scaled_variance is 0.934227018947562
Correlation between pr.axis_rectangularity & scaled_variance.1 is 0.9882131554653995
Correlation between scaled_variance & scaled_variance.1 is 0.9456775219969437
In [30]:
print('Insights From Correlation Heatmap:\n');
print('Attributes with low correlation of -0.3 TO 0.3:\n');
a=newdf.corr()
i = 0
c = 0
j = 0
n=len(a.columns)
col=list(a.columns.values)
ind=list(a.index.values)
while i < n:  
    sInd=ind[i]
    while j < n:
        sCol=col[j]
        value=a.loc[sInd,sCol]
        if((value<.3)&(sInd!=sCol)&(value>-0.3)):
            print('Correlation between',sInd,'&',sCol,'is',a.loc[sInd,sCol])
        j = j+1 
    c = c+1
    j = c
    i = i+1
Insights From Correlation Heatmap:

Attributes with low correlation of -0.3 TO 0.3:

Correlation between compactness & pr.axis_aspect_ratio is 0.09153432870639264
Correlation between compactness & max.length_aspect_ratio is 0.14824918609375565
Correlation between compactness & scaled_radius_of_gyration.1 is -0.24959256235551516
Correlation between compactness & skewness_about is 0.23607838250749777
Correlation between compactness & skewness_about.1 is 0.1570146241186882
Correlation between compactness & skewness_about.2 is 0.29853704522125996
Correlation between compactness & class is -0.033795598595698716
Correlation between circularity & pr.axis_aspect_ratio is 0.1537782392464515
Correlation between circularity & max.length_aspect_ratio is 0.2514667821406536
Correlation between circularity & scaled_radius_of_gyration.1 is 0.05194637831300446
Correlation between circularity & skewness_about is 0.14419763228132643
Correlation between circularity & skewness_about.1 is -0.01143858188971866
Correlation between circularity & skewness_about.2 is -0.1044264643461786
Correlation between circularity & hollows_ratio is 0.04635076561078029
Correlation between circularity & class is -0.15890986607266508
Correlation between distance_circularity & pr.axis_aspect_ratio is 0.15845567255572948
Correlation between distance_circularity & max.length_aspect_ratio is 0.2646863352835796
Correlation between distance_circularity & scaled_radius_of_gyration.1 is -0.22594376356939186
Correlation between distance_circularity & skewness_about is 0.11392408296382121
Correlation between distance_circularity & skewness_about.1 is 0.2655466197670097
Correlation between distance_circularity & skewness_about.2 is 0.1460982321315774
Correlation between distance_circularity & class is -0.06446738706597277
Correlation between radius_ratio & scaled_radius_of_gyration.1 is -0.1803973511353748
Correlation between radius_ratio & skewness_about is 0.04871267587152077
Correlation between radius_ratio & skewness_about.1 is 0.17374087994085555
Correlation between radius_ratio & class is -0.18218592973588704
Correlation between pr.axis_aspect_ratio & scatter_ratio is 0.10373196114802682
Correlation between pr.axis_aspect_ratio & elongatedness is -0.18303494751584753
Correlation between pr.axis_aspect_ratio & pr.axis_rectangularity is 0.07960365940337463
Correlation between pr.axis_aspect_ratio & max.length_rectangularity is 0.12690920740527828
Correlation between pr.axis_aspect_ratio & scaled_variance is 0.27291008858392113
Correlation between pr.axis_aspect_ratio & scaled_variance.1 is 0.08918872159881902
Correlation between pr.axis_aspect_ratio & scaled_radius_of_gyration is 0.12197089207975687
Correlation between pr.axis_aspect_ratio & scaled_radius_of_gyration.1 is 0.15294990067392705
Correlation between pr.axis_aspect_ratio & skewness_about is -0.05837059554151864
Correlation between pr.axis_aspect_ratio & skewness_about.1 is -0.031976062919945016
Correlation between pr.axis_aspect_ratio & skewness_about.2 is 0.239885789108611
Correlation between pr.axis_aspect_ratio & hollows_ratio is 0.2677252432869659
Correlation between pr.axis_aspect_ratio & class is -0.09817840759918779
Correlation between max.length_aspect_ratio & scatter_ratio is 0.16619119462406198
Correlation between max.length_aspect_ratio & elongatedness is -0.1801400738034967
Correlation between max.length_aspect_ratio & pr.axis_rectangularity is 0.16150199705886506
Correlation between max.length_aspect_ratio & scaled_variance.1 is 0.14325316507906175
Correlation between max.length_aspect_ratio & scaled_radius_of_gyration is 0.18974277204226994
Correlation between max.length_aspect_ratio & scaled_radius_of_gyration.1 is 0.2957346517280677
Correlation between max.length_aspect_ratio & skewness_about is 0.015599232976104318
Correlation between max.length_aspect_ratio & skewness_about.1 is 0.04342185184103971
Correlation between max.length_aspect_ratio & skewness_about.2 is -0.02608061059485162
Correlation between max.length_aspect_ratio & hollows_ratio is 0.14391872794302824
Correlation between max.length_aspect_ratio & class is 0.20761937497926236
Correlation between scatter_ratio & scaled_radius_of_gyration.1 is -0.027541865538587555
Correlation between scatter_ratio & skewness_about is 0.07445766202724009
Correlation between scatter_ratio & skewness_about.1 is 0.2124281930624041
Correlation between scatter_ratio & skewness_about.2 is 0.0056277294762155676
Correlation between scatter_ratio & hollows_ratio is 0.11881748927511104
Correlation between scatter_ratio & class is -0.2888951615185675
Correlation between elongatedness & scaled_radius_of_gyration.1 is 0.10330202814036216
Correlation between elongatedness & skewness_about is -0.05259968010681245
Correlation between elongatedness & skewness_about.1 is -0.18505343989895195
Correlation between elongatedness & skewness_about.2 is -0.11512588695649002
Correlation between elongatedness & hollows_ratio is -0.21690480524139333
Correlation between pr.axis_rectangularity & scaled_radius_of_gyration.1 is -0.015495384555020388
Correlation between pr.axis_rectangularity & skewness_about is 0.08376714782182272
Correlation between pr.axis_rectangularity & skewness_about.1 is 0.21470045390199183
Correlation between pr.axis_rectangularity & skewness_about.2 is -0.01864857177283112
Correlation between pr.axis_rectangularity & hollows_ratio is 0.09928622370701838
Correlation between pr.axis_rectangularity & class is -0.25848110987663037
Correlation between max.length_rectangularity & scaled_radius_of_gyration.1 is 0.04162172926560247
Correlation between max.length_rectangularity & skewness_about is 0.1358515409521967
Correlation between max.length_rectangularity & skewness_about.1 is 0.0013656565108511377
Correlation between max.length_rectangularity & skewness_about.2 is -0.10394774781346748
Correlation between max.length_rectangularity & hollows_ratio is 0.07676961657734818
Correlation between max.length_rectangularity & class is -0.03239877304458095
Correlation between scaled_variance & scaled_radius_of_gyration.1 is 0.11307780859943668
Correlation between scaled_variance & skewness_about is 0.03672901373612493
Correlation between scaled_variance & skewness_about.1 is 0.1942385049619721
Correlation between scaled_variance & skewness_about.2 is 0.014219233677517507
Correlation between scaled_variance & hollows_ratio is 0.08569514618457441
Correlation between scaled_variance.1 & scaled_radius_of_gyration.1 is -0.015400553708787179
Correlation between scaled_variance.1 & skewness_about is 0.07687725391960785
Correlation between scaled_variance.1 & skewness_about.1 is 0.20081053290491052
Correlation between scaled_variance.1 & skewness_about.2 is 0.006218998091687836
Correlation between scaled_variance.1 & hollows_ratio is 0.10293532345207396
Correlation between scaled_variance.1 & class is -0.28811503056442306
Correlation between scaled_radius_of_gyration & scaled_radius_of_gyration.1 is 0.19147281500281518
Correlation between scaled_radius_of_gyration & skewness_about is 0.16648269330912877
Correlation between scaled_radius_of_gyration & skewness_about.1 is -0.05615308122611754
Correlation between scaled_radius_of_gyration & skewness_about.2 is -0.22445021305001703
Correlation between scaled_radius_of_gyration & hollows_ratio is -0.11800177191536232
Correlation between scaled_radius_of_gyration & class is -0.250266521291379
Correlation between scaled_radius_of_gyration.1 & skewness_about is -0.08835544532565248
Correlation between scaled_radius_of_gyration.1 & skewness_about.1 is -0.12618294073751143
Correlation between scaled_radius_of_gyration.1 & class is -0.2126012565690494
Correlation between skewness_about & skewness_about.1 is -0.03499014271203469
Correlation between skewness_about & skewness_about.2 is 0.1152973554955333
Correlation between skewness_about & hollows_ratio is 0.09712584441905797
Correlation between skewness_about & class is 0.11958105104950455
Correlation between skewness_about.1 & skewness_about.2 is 0.07731025073053145
Correlation between skewness_about.1 & hollows_ratio is 0.20498997534082655
Correlation between skewness_about.1 & class is -0.010680096830285956
Correlation between skewness_about.2 & class is 0.06724435446714724
Correlation between hollows_ratio & class is 0.23587414847233593

If two features is highly correlated then there is no point using both features.in that case, we can drop one feature. SNS heatmap gives us the correlation matrix where we can see which features are highly correlated.

From above correlation matrix we can see that there are many features which are highly correlated. if we carefully analyse, we will find that many features are there which having more than 0.9 correlation. so we can decide to get rid of those columns whose correlation is +-0.9 or above.

In [31]:
sns.pairplot(newdf,hue='class' ,diag_kind="kde")
plt.show()
C:\Users\Krish\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:487: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:\Users\Krish\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2

Observations:

We can see the similar observation from the Correlation heat map, that there are few Attributes which are Highly Positively/Negatively correlated

3. Split the data into train and test (Suggestion: specify “random state” if you are using train_test_split from Sklearn) (5 marks)

As per the above correlation values, from the attribute which has high correlation we can drop any one of the atribute, and as per analysis we can DROP the below mentioned attributes:

max.length_rectangularity
scaled_radius_of_gyration
distance_circularity
elongatedness
pr.axis_rectangularity
scaled_variance
scaled_variance.1

First let us train the model with the RAW Data and then use the PCA to decide on the dimentionality reduction

In [108]:
#Split our data into train and test data set
seed=12
x=newdf.drop(['class'],axis=1)
y=newdf['class']
x_train_df,x_test_df,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=seed)
In [109]:
x_train_df
Out[109]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
832 108.0 49.0 109.0 204.0 61.0 11.0 212.0 31.0 24.0 159.0 229.0 665.0 215.0 71.0 16.0 11.0 190.0 199.0
40 95.0 48.0 104.0 214.0 67.0 9.0 205.0 32.0 23.0 151.0 227.0 628.0 202.0 74.0 5.0 9.0 186.0 193.0
460 90.0 41.0 62.0 147.0 60.0 6.0 128.0 52.0 18.0 141.0 149.0 246.0 157.0 61.0 13.0 4.0 201.0 208.0
113 88.0 35.0 50.0 121.0 58.0 5.0 114.0 59.0 17.0 122.0 132.0 192.0 138.0 74.0 21.0 4.0 182.0 187.0
822 95.0 41.0 82.0 170.0 65.0 9.0 145.0 46.0 19.0 145.0 163.0 314.0 140.0 64.0 4.0 8.0 199.0 207.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
241 93.0 45.0 73.0 164.0 59.0 7.0 159.0 42.0 20.0 146.0 182.0 379.0 188.0 65.0 11.0 15.0 195.0 201.0
253 94.0 43.0 68.0 170.0 67.0 6.0 142.0 46.0 18.0 142.0 164.0 310.0 177.0 65.0 10.0 8.0 198.0 203.0
390 86.0 42.0 65.0 113.0 50.0 8.0 152.0 45.0 19.0 141.0 169.0 332.0 171.0 85.0 4.0 16.0 179.0 183.0
667 110.0 53.0 104.0 223.0 66.0 10.0 211.0 32.0 24.0 164.0 223.0 659.0 210.0 67.0 5.0 16.0 190.0 203.0
843 106.0 54.0 101.0 222.0 67.0 12.0 222.0 30.0 25.0 173.0 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201.0

592 rows × 18 columns

In [110]:
y_train
Out[110]:
832    1.0
40     1.0
460    2.0
113    1.0
822    2.0
      ... 
241    1.0
253    0.0
390    0.0
667    1.0
843    1.0
Name: class, Length: 592, dtype: float64
In [111]:
#Checking the split of the data
print("{0:0.2f}% data is in training set".format((len(x_train_df)/len(newdf)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test_df)/len(newdf)) * 100))
69.98% data is in training set
30.02% data is in test set
In [112]:
print("Original class bus Values    : {0} ({1:0.2f}%)".format(len(newdf.loc[newdf['class'] == 0]), (len(newdf.loc[newdf['class'] == 0])/len(newdf)) * 100))
print("Original class car Values   : {0} ({1:0.2f}%)".format(len(newdf.loc[newdf['class'] == 1]), (len(newdf.loc[newdf['class'] == 1])/len(newdf)) * 100))
print("Original class van Values   : {0} ({1:0.2f}%)".format(len(newdf.loc[newdf['class'] == 2]), (len(newdf.loc[newdf['class'] == 2])/len(newdf)) * 100))
print("")
print("Training class bus Values    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train == 0])/len(y_train)) * 100))
print("Training class car Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train == 1])/len(y_train)) * 100))
print("Training class van Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 2]), (len(y_train[y_train == 2])/len(y_train)) * 100))
print("")
print("Test class bus Values        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test == 0])/len(y_test)) * 100))
print("Test class car Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test == 1])/len(y_test)) * 100))
print("Test class van Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 2]), (len(y_test[y_test == 2])/len(y_test)) * 100))
print("")
Original class bus Values    : 218 (25.77%)
Original class car Values   : 429 (50.71%)
Original class van Values   : 199 (23.52%)

Training class bus Values    : 157 (26.52%)
Training class car Values   : 299 (50.51%)
Training class van Values   : 136 (22.97%)

Test class bus Values        : 61 (24.02%)
Test class car Values       : 130 (51.18%)
Test class van Values       : 63 (24.80%)

In [113]:
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(x_train_df)
x_train = pd.DataFrame(scaled_df)
In [114]:
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(x_test_df)
x_test = pd.DataFrame(scaled_df)

4. Train a Support vector machine using the train set and get the accuracy on the test set (10 marks)

4a. Linear Support Vector Machine

In [115]:
#Linear Support vector Machine
lsvm = SVC(kernel='linear',random_state=seed)
lsvm.fit(x_train, y_train)
Out[115]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=12,
    shrinking=True, tol=0.001, verbose=False)
In [116]:
print('Train Data Score :',np.round(lsvm.score(x_train, y_train),4))
print('Test Data Score :',np.round(lsvm.score(x_test, y_test),4))
Train Data Score : 0.9713
Test Data Score : 0.9291
In [117]:
#Predict for train set
pred_train = lsvm.predict(x_train)

#Confusion Matrix
lsvm_cm_train = pd.DataFrame(confusion_matrix(y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
lsvm_cm_train.index.name = "Predicted"
lsvm_cm_train.columns.name = "True"
lsvm_cm_train
Out[117]:
True Bus Car Van
Predicted
Bus 152 7 0
Car 5 289 2
Van 0 3 134
In [118]:
#Predict for test set
pred_test = lsvm.predict(x_test)

#Confusion Matrix
lsvm_cm_test = pd.DataFrame(confusion_matrix(y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
lsvm_cm_test.index.name = "Predicted"
lsvm_cm_test.columns.name = "True"
lsvm_cm_test
Out[118]:
True Bus Car Van
Predicted
Bus 56 9 2
Car 1 121 2
Van 4 0 59
In [119]:
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Linear Support vector machine for test data: \n")
ax=sns.heatmap(lsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
In [120]:
#summarize the fit of the model
lsvm_accuracy = np.round( metrics.accuracy_score( y_test, pred_test ), 4 )
#lsvm_precision = np.round( metrics.precision_score( y_test, pred_test ,average=None), 4 )
#lsvm_recall    = np.round( metrics.recall_score( y_test, pred_test,average=None ), 4 )
#lsvm_f1score = np.round( metrics.f1_score( y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', lsvm_accuracy)
print('\n') 
print('Metrics Classification Report for linear Support vector machine regression\n',metrics.classification_report(y_test, pred_test))
Total Accuracy :  0.9291


Metrics Classification Report for linear Support vector machine regression
               precision    recall  f1-score   support

         0.0       0.84      0.92      0.88        61
         1.0       0.98      0.93      0.95       130
         2.0       0.94      0.94      0.94        63

    accuracy                           0.93       254
   macro avg       0.92      0.93      0.92       254
weighted avg       0.93      0.93      0.93       254

4b. Poly Support Vector Machine

In [121]:
psvm = SVC(kernel='poly',random_state=seed,gamma='scale')
psvm.fit(x_train, y_train)
Out[121]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='poly',
    max_iter=-1, probability=False, random_state=12, shrinking=True, tol=0.001,
    verbose=False)
In [122]:
#Poly Support Vector machine
print('Train Data Score :',np.round(psvm.score(x_train, y_train),4))
print('Test Data Score :',np.round(psvm.score(x_test, y_test),4))
Train Data Score : 0.8345
Test Data Score : 0.8228
In [123]:
#Predict for train set
pred_train = psvm.predict(x_train)

#Confusion Matrix
psvm_cm_train = pd.DataFrame(confusion_matrix(y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
psvm_cm_train.index.name = "Predicted"
psvm_cm_train.columns.name = "True"
psvm_cm_train
Out[123]:
True Bus Car Van
Predicted
Bus 111 1 0
Car 46 295 48
Van 0 3 88
In [124]:
#Predict for test set
pred_test = psvm.predict(x_test)

#Confusion Matrix
psvm_cm_test = pd.DataFrame(confusion_matrix(y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
psvm_cm_test.index.name = "Predicted"
psvm_cm_test.columns.name = "True"
psvm_cm_test
Out[124]:
True Bus Car Van
Predicted
Bus 43 3 1
Car 17 127 23
Van 1 0 39
In [125]:
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for poly Support vector machine for test data: \n")
ax=sns.heatmap(psvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
In [126]:
#summarize the fit of the model
psvm_accuracy = np.round( metrics.accuracy_score( y_test, pred_test ), 4 )
#psvm_precision = np.round( metrics.precision_score( y_test, pred_test ,average=None), 4 )
#psvm_recall    = np.round( metrics.recall_score( y_test, pred_test,average=None ), 4 )
#psvm_f1score = np.round( metrics.f1_score( y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', psvm_accuracy)
print('\n') 
print('Metrics Classification Report for poly Support vector machine regression\n',metrics.classification_report(y_test, pred_test))
Total Accuracy :  0.8228


Metrics Classification Report for poly Support vector machine regression
               precision    recall  f1-score   support

         0.0       0.91      0.70      0.80        61
         1.0       0.76      0.98      0.86       130
         2.0       0.97      0.62      0.76        63

    accuracy                           0.82       254
   macro avg       0.88      0.77      0.80       254
weighted avg       0.85      0.82      0.82       254

4c. Radial basis function Support vector machine

In [127]:
rsvm = SVC(kernel='rbf',random_state=seed,gamma='scale')
rsvm.fit(x_train, y_train)
Out[127]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=12, shrinking=True, tol=0.001,
    verbose=False)
In [128]:
#rbf Support Vector machine
print('Train Data Score :',np.round(rsvm.score(x_train, y_train),4))
print('Test Data Score :',np.round(rsvm.score(x_test, y_test),4))
Train Data Score : 0.978
Test Data Score : 0.9449
In [129]:
#Predict for train set
pred_train = rsvm.predict(x_train)

#Confusion Matrix
rsvm_cm_train = pd.DataFrame(confusion_matrix(y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
rsvm_cm_train.index.name = "Predicted"
rsvm_cm_train.columns.name = "True"
rsvm_cm_train
Out[129]:
True Bus Car Van
Predicted
Bus 156 0 0
Car 0 294 7
Van 1 5 129
In [130]:
#Predict for test set
pred_test = rsvm.predict(x_test)

#Confusion Matrix
rsvm_cm_test = pd.DataFrame(confusion_matrix(y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
rsvm_cm_test.index.name = "Predicted"
rsvm_cm_test.columns.name = "True"
rsvm_cm_test
Out[130]:
True Bus Car Van
Predicted
Bus 58 5 1
Car 0 124 4
Van 3 1 58
In [131]:
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Radial basis function Support vector machine for test data: \n")
ax=sns.heatmap(rsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
In [132]:
#summarize the fit of the model
rsvm_accuracy = np.round( metrics.accuracy_score( y_test, pred_test ), 4 )
#rsvm_precision = np.round( metrics.precision_score( y_test, pred_test ,average=None), 4 )
#rsvm_recall    = np.round( metrics.recall_score( y_test, pred_test,average=None ), 4 )
#rsvm_f1score = np.round( metrics.f1_score( y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', rsvm_accuracy)
print('\n') 
print('Metrics Classification Report for Radial basis function Support vector machine regression\n',metrics.classification_report(y_test, pred_test))
Total Accuracy :  0.9449


Metrics Classification Report for Radial basis function Support vector machine regression
               precision    recall  f1-score   support

         0.0       0.91      0.95      0.93        61
         1.0       0.97      0.95      0.96       130
         2.0       0.94      0.92      0.93        63

    accuracy                           0.94       254
   macro avg       0.94      0.94      0.94       254
weighted avg       0.95      0.94      0.95       254

4d. Comparing SVM Accuracy scores for different Kernel

In [133]:
SVMresult_Before_PCA = pd.DataFrame({'Model' : ['SVM Linear', 'SVM Polynomial', 'SVM RBF'], 
                       'Model Accuracy Before PCA' : [lsvm_accuracy, psvm_accuracy, rsvm_accuracy],
                      })
SVMresult_Before_PCA
Out[133]:
Model Model Accuracy Before PCA
0 SVM Linear 0.9291
1 SVM Polynomial 0.8228
2 SVM RBF 0.9449

Insights:
From the above results, Radial basis function SVM Model trained using the RAW Data gives the higher Accuracy when compared to the Linear and the Polynomial SVM Model

5. Perform K-fold cross validation and get the cross validation score of the model (optional)

In [134]:
#K fold Cross validation using the K- Fold value as 10, in the Linear SVM Model
kf=KFold(n_splits= 10, random_state = seed)
lsvm_results = cross_val_score(estimator = lsvm, X = x_train, y = y_train, cv = kf)
lsvm_kf_accuracy=lsvm_results.mean()
print(lsvm_kf_accuracy)
0.9594350282485875
In [135]:
#K fold Cross validation in the Polynomial SVM Model
psvm_results = cross_val_score(estimator = psvm, X = x_train, y = y_train, cv = kf)
psvm_kf_accuracy=psvm_results.mean()
print(psvm_kf_accuracy)
0.7887853107344631
In [136]:
#K fold Cross validation in the RBF SVM Model
rsvm_results = cross_val_score(estimator = rsvm, X = x_train, y = y_train, cv = kf)
rsvm_kf_accuracy=rsvm_results.mean()
print(rsvm_kf_accuracy)
0.9644632768361581
In [137]:
Cross_Validation_Score_Before_PCA = pd.DataFrame({'Model' : ['SVM Linear KF', 'SVM Polynomial KF', 'SVM RBF KF'], 
                       ' Cross validation Score Before PCA' : [lsvm_kf_accuracy, psvm_kf_accuracy, rsvm_kf_accuracy],
                      })
Cross_Validation_Score_Before_PCA
Out[137]:
Model Cross validation Score Before PCA
0 SVM Linear KF 0.959435
1 SVM Polynomial KF 0.788785
2 SVM RBF KF 0.964463

Insights:
From the above results, Radial basis function SVM Model trained using the RAW Data gives the highest Average Accuracy in the K-Fold Validation when compared to the Linear and the Polynomial SVM Model

6. Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data – (10 points)

We will perform PCA in the following steps:

Split our data into train and test data set
normalize the training set using standard scalar
Calculate the covariance matrix.
Calculate the eigenvectors and their eigenvalues.
Sort the eigenvectors according to their eigenvalues in descending order.
Choose the first K eigenvectors (where k is the dimension we'd like to end up with).
Build new dataset with reduced dimensionality.

In [138]:
#Split our data into train and test data set
x_PCA=newdf.drop(['class'],axis=1)
y_PCA=newdf['class']
x_train_df_PCA,x_test_df_PCA,y_train_PCA,y_test_PCA=train_test_split(x_PCA,y_PCA,test_size=0.3,random_state=seed)
In [139]:
scaler = preprocessing.StandardScaler()
scaled_df_PCA = scaler.fit_transform(x_train_df_PCA)
x_train_PCA = pd.DataFrame(scaled_df_PCA)

scaler = preprocessing.StandardScaler()
scaled_df_PCA = scaler.fit_transform(x_test_df_PCA)
x_test_PCA = pd.DataFrame(scaled_df_PCA)
In [140]:
shape=x_train_PCA.shape  #Provides the Shape in (Rows, Columns) in the Data Frame df
print('shape of the data frame is =',shape)
shape of the data frame is = (592, 18)
In [141]:
#Checking the split of the data
print("{0:0.2f}% data is in training set".format((len(x_train_PCA)/len(newdf)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test_PCA)/len(newdf)) * 100))
69.98% data is in training set
30.02% data is in test set
In [142]:
#Calculation of CovMatrix
covMatrix = np.cov(x_train_PCA,rowvar=False)
print(covMatrix)
[[ 1.00169205e+00  6.78118442e-01  7.87371455e-01  6.85806823e-01
   6.92936388e-02  1.00953096e-01  8.12907258e-01 -7.95049523e-01
   8.11587938e-01  6.68780927e-01  7.52926433e-01  8.11512959e-01
   5.80048130e-01 -2.59774752e-01  2.71832010e-01  1.58841508e-01
   3.06211489e-01  3.72922061e-01]
 [ 6.78118442e-01  1.00169205e+00  7.99760286e-01  6.15542826e-01
   1.29678979e-01  2.20158986e-01  8.49211771e-01 -8.24341124e-01
   8.47289748e-01  9.63149729e-01  7.93252045e-01  8.34997729e-01
   9.24076826e-01  4.95824276e-02  1.73691433e-01  3.36053851e-03
  -1.11258955e-01  4.25194075e-02]
 [ 7.87371455e-01  7.99760286e-01  1.00169205e+00  7.56615994e-01
   1.29550290e-01  2.16666148e-01  9.08660706e-01 -9.15452254e-01
   8.98752458e-01  7.83870318e-01  8.58688058e-01  8.89413802e-01
   7.18645934e-01 -2.14505126e-01  1.44092169e-01  2.63425998e-01
   1.28155999e-01  3.16384531e-01]
 [ 6.85806823e-01  6.15542826e-01  7.56615994e-01  1.00169205e+00
   6.58070111e-01  4.38253611e-01  7.27959652e-01 -7.83039183e-01
   7.01465481e-01  5.63834905e-01  7.94602532e-01  7.11598430e-01
   5.38189452e-01 -1.41190503e-01  7.58962418e-02  1.64723873e-01
   3.66262053e-01  4.51476205e-01]
 [ 6.92936388e-02  1.29678979e-01  1.29550290e-01  6.58070111e-01
   1.00169205e+00  6.77275516e-01  8.27504813e-02 -1.58792505e-01
   5.67924954e-02  1.06995780e-01  2.73013570e-01  6.90308465e-02
   1.02270493e-01  2.16939886e-01 -5.67234892e-02 -4.09988198e-02
   2.09024922e-01  2.30856310e-01]
 [ 1.00953096e-01  2.20158986e-01  2.16666148e-01  4.38253611e-01
   6.77275516e-01  1.00169205e+00  1.29357343e-01 -1.44403394e-01
   1.25901754e-01  2.72892208e-01  2.96868626e-01  1.07145300e-01
   1.65843550e-01  3.50701312e-01  3.66809287e-02  2.19695217e-02
  -5.27144083e-02  1.06488359e-01]
 [ 8.12907258e-01  8.49211771e-01  9.08660706e-01  7.27959652e-01
   8.27504813e-02  1.29357343e-01  1.00169205e+00 -9.72950149e-01
   9.92985352e-01  8.11825556e-01  9.44655858e-01  9.93461512e-01
   8.09593828e-01 -2.48623572e-02  9.76538693e-02  2.18391167e-01
  -3.38774770e-03  1.09792962e-01]
 [-7.95049523e-01 -8.24341124e-01 -9.15452254e-01 -7.83039183e-01
  -1.58792505e-01 -1.44403394e-01 -9.72950149e-01  1.00169205e+00
  -9.53323551e-01 -7.80933879e-01 -9.33490788e-01 -9.55303000e-01
  -7.74593828e-01  9.52520620e-02 -7.50195607e-02 -1.92443470e-01
  -1.08041790e-01 -2.08898339e-01]
 [ 8.11587938e-01  8.47289748e-01  8.98752458e-01  7.01465481e-01
   5.67924954e-02  1.25901754e-01  9.92985352e-01 -9.53323551e-01
   1.00169205e+00  8.16104132e-01  9.32218854e-01  9.90592729e-01
   8.09871301e-01 -1.39278013e-02  1.04676137e-01  2.17793401e-01
  -2.76403926e-02  9.12456788e-02]
 [ 6.68780927e-01  9.63149729e-01  7.83870318e-01  5.63834905e-01
   1.06995780e-01  2.72892208e-01  8.11825556e-01 -7.80933879e-01
   8.16104132e-01  1.00169205e+00  7.43538991e-01  7.95551347e-01
   8.71645540e-01  4.53973531e-02  1.75647570e-01  1.34845236e-02
  -1.16056849e-01  6.73921269e-02]
 [ 7.52926433e-01  7.93252045e-01  8.58688058e-01  7.94602532e-01
   2.73013570e-01  2.96868626e-01  9.44655858e-01 -9.33490788e-01
   9.32218854e-01  7.43538991e-01  1.00169205e+00  9.41303980e-01
   7.85710861e-01  1.38933884e-01  5.83383738e-02  1.97887938e-01
  -9.87727841e-05  6.89946542e-02]
 [ 8.11512959e-01  8.34997729e-01  8.89413802e-01  7.11598430e-01
   6.90308465e-02  1.07145300e-01  9.93461512e-01 -9.55303000e-01
   9.90592729e-01  7.95551347e-01  9.41303980e-01  1.00169205e+00
   8.04228754e-01 -1.30637845e-02  9.94995677e-02  2.04970163e-01
  -3.08091413e-03  9.43198346e-02]
 [ 5.80048130e-01  9.24076826e-01  7.18645934e-01  5.38189452e-01
   1.02270493e-01  1.65843550e-01  8.09593828e-01 -7.74593828e-01
   8.09871301e-01  8.71645540e-01  7.85710861e-01  8.04228754e-01
   1.00169205e+00  1.85320920e-01  1.80839523e-01 -4.82969699e-02
  -2.25549318e-01 -1.18522491e-01]
 [-2.59774752e-01  4.95824276e-02 -2.14505126e-01 -1.41190503e-01
   2.16939886e-01  3.50701312e-01 -2.48623572e-02  9.52520620e-02
  -1.39278013e-02  4.53973531e-02  1.38933884e-01 -1.30637845e-02
   1.85320920e-01  1.00169205e+00 -1.14914418e-01 -1.38193881e-01
  -7.32497402e-01 -7.91490503e-01]
 [ 2.71832010e-01  1.73691433e-01  1.44092169e-01  7.58962418e-02
  -5.67234892e-02  3.66809287e-02  9.76538693e-02 -7.50195607e-02
   1.04676137e-01  1.75647570e-01  5.83383738e-02  9.94995677e-02
   1.80839523e-01 -1.14914418e-01  1.00169205e+00 -4.25774461e-02
   1.55485407e-01  1.43528577e-01]
 [ 1.58841508e-01  3.36053851e-03  2.63425998e-01  1.64723873e-01
  -4.09988198e-02  2.19695217e-02  2.18391167e-01 -1.92443470e-01
   2.17793401e-01  1.34845236e-02  1.97887938e-01  2.04970163e-01
  -4.82969699e-02 -1.38193881e-01 -4.25774461e-02  1.00169205e+00
   8.06690591e-02  2.09369278e-01]
 [ 3.06211489e-01 -1.11258955e-01  1.28155999e-01  3.66262053e-01
   2.09024922e-01 -5.27144083e-02 -3.38774770e-03 -1.08041790e-01
  -2.76403926e-02 -1.16056849e-01 -9.87727841e-05 -3.08091413e-03
  -2.25549318e-01 -7.32497402e-01  1.55485407e-01  8.06690591e-02
   1.00169205e+00  8.93351208e-01]
 [ 3.72922061e-01  4.25194075e-02  3.16384531e-01  4.51476205e-01
   2.30856310e-01  1.06488359e-01  1.09792962e-01 -2.08898339e-01
   9.12456788e-02  6.73921269e-02  6.89946542e-02  9.43198346e-02
  -1.18522491e-01 -7.91490503e-01  1.43528577e-01  2.09369278e-01
   8.93351208e-01  1.00169205e+00]]
In [143]:
pca = PCA(n_components=18,random_state=seed)
pca.fit(x_train_PCA)
Out[143]:
PCA(copy=True, iterated_power='auto', n_components=18, random_state=12,
    svd_solver='auto', tol=0.0, whiten=False)
In [144]:
#The eigen Values
print(pca.explained_variance_)
[9.37585040e+00 2.99653283e+00 2.00920122e+00 1.19332515e+00
 8.87310633e-01 5.35453076e-01 3.37403957e-01 2.13432212e-01
 1.64770725e-01 8.68055727e-02 6.92702647e-02 4.65866062e-02
 3.53583409e-02 2.82378283e-02 2.06158337e-02 1.74482714e-02
 9.06385005e-03 3.79008751e-03]
In [145]:
#The eigen Vectors
print(pca.components_)
[[ 2.74536287e-01  2.94026070e-01  3.05394716e-01  2.64843416e-01
   7.15013070e-02  8.37452609e-02  3.18040672e-01 -3.15146187e-01
   3.15424264e-01  2.84009127e-01  3.08846995e-01  3.14514633e-01
   2.75115280e-01 -1.69724573e-02  5.23991387e-02  5.96087639e-02
   2.54951310e-02  6.92308185e-02]
 [-1.47949616e-01  1.17389406e-01 -7.26374270e-02 -1.69672658e-01
  -7.07376612e-02  5.51863947e-02  4.05006048e-02  1.79072077e-02
   5.21497833e-02  1.11774162e-01  6.80833481e-02  4.55752775e-02
   1.99700319e-01  4.97322113e-01 -8.94426942e-02 -1.28622167e-01
  -5.41988253e-01 -5.41133415e-01]
 [-9.93831675e-02 -3.88666867e-02 -4.98304105e-02  3.01754837e-01
   6.47282191e-01  5.87511000e-01 -9.45024372e-02  5.08028442e-02
  -1.07653230e-01 -3.13509739e-02  6.64071458e-02 -1.04952119e-01
  -5.13461202e-02  2.61018924e-01 -8.56481152e-02 -6.42367319e-02
   5.54557506e-02  7.82093874e-02]
 [ 7.90572123e-02  1.82352789e-01 -6.68141222e-02 -5.35017594e-02
   8.70342653e-03  5.03901918e-02 -9.56621532e-02  8.90546934e-02
  -8.97633189e-02  1.91641329e-01 -1.21555589e-01 -9.14412458e-02
   1.92759031e-01 -4.69655673e-02  6.28610576e-01 -6.49085360e-01
   9.05679331e-02  3.90565100e-02]
 [ 5.60877250e-02 -7.16972392e-02  3.81007681e-02 -7.53941730e-02
  -6.74045077e-02  2.21923157e-01 -1.52675641e-02  8.38283189e-02
   2.78566944e-04 -3.44958008e-02 -2.65631551e-03 -2.12829236e-02
  -6.51144726e-02  1.43343277e-01  6.92347966e-01  6.35154267e-01
  -1.15781757e-01 -3.02468387e-02]
 [ 2.11171765e-01 -3.31803769e-01 -1.35897304e-01  2.16825020e-01
   1.68360590e-01 -3.69822223e-01  1.17229052e-01 -1.25375672e-01
   9.85831719e-02 -4.66723823e-01  2.46457186e-01  1.63673020e-01
  -1.59467451e-01  2.63126378e-01  2.32641983e-01 -1.86704248e-01
   1.59282318e-01 -2.47304163e-01]
 [ 4.44675877e-01 -2.22476830e-01  9.89011398e-02 -2.02942491e-01
  -4.06348902e-01  5.35744790e-01  6.01381648e-02  1.41221149e-02
   9.58177958e-02 -5.99781083e-02  7.26789566e-02  7.16122863e-02
  -3.68717670e-01  7.24119066e-02 -1.16373312e-01 -2.55815264e-01
  -2.63525288e-02  4.21729788e-02]
 [-5.24904828e-01 -1.97346754e-01  4.61033943e-01  1.33447913e-01
  -4.06600687e-02  1.35611725e-01  9.17162000e-02 -1.82023465e-01
   6.46133275e-02 -2.66381351e-01  2.89741983e-02  4.68415648e-02
  -1.05184540e-01 -3.57620003e-01  1.66019605e-01 -1.98885022e-01
  -3.27777461e-01 -4.36019877e-02]
 [-5.04474695e-01 -4.57432538e-02 -1.17517659e-01 -2.37177119e-01
  -2.95198702e-01  1.72199142e-01  5.41202414e-02 -1.57013008e-01
   2.91544163e-02 -9.01780587e-02  2.99988372e-01  8.73861241e-02
   2.44325293e-01  2.99908182e-01  1.71166281e-02 -1.53330830e-02
   4.90501527e-01  1.77768194e-01]
 [ 2.72390133e-01 -1.14110684e-01  6.88325757e-02  5.61388942e-02
  -9.94196706e-02  1.62020301e-01 -1.29853693e-01  1.96071198e-01
  -1.23841285e-01 -4.59402531e-01  1.01789498e-01 -1.09848341e-01
   7.13584223e-01 -1.94324599e-01 -9.42557000e-02  4.78434559e-02
  -5.77002122e-02 -5.63033993e-02]
 [ 1.03892522e-01 -6.39356856e-03  7.12020753e-01 -7.61651591e-02
   3.35050233e-04 -2.35125314e-01 -2.21593416e-01 -2.86918955e-02
  -3.11351872e-01  7.12984265e-02  1.41714356e-01 -3.00371399e-01
  -4.43744366e-02  3.74113979e-01 -3.18731513e-02 -1.81083932e-02
   1.07269982e-01  6.37924748e-02]
 [ 5.52939608e-03 -2.58348937e-01  3.59541233e-02 -1.55013797e-01
   1.36487930e-01 -1.68251710e-01  8.33515302e-02  1.79439371e-01
   2.35195451e-01 -4.95600128e-02 -1.92430654e-01  2.05034804e-01
   1.67120486e-01  2.95619343e-01  2.64069590e-03 -7.90172967e-02
  -3.45319322e-01  6.66454501e-01]
 [-1.39321743e-01  9.81549310e-04  2.10329691e-01  7.54650289e-02
   6.38388521e-03 -3.54866890e-02 -4.75979565e-02  8.12639186e-01
   2.06721131e-01  1.18880181e-01  2.06234517e-01  3.01669861e-01
  -8.09158452e-02 -5.92763613e-02 -1.49227244e-02 -2.08007490e-02
   1.92638151e-01 -1.68752735e-01]
 [ 4.43635148e-02 -5.39450360e-01  1.57817749e-01 -3.89811642e-01
   3.08978038e-01  4.62059823e-02  5.11859072e-02 -1.43750933e-01
   1.23039476e-01  2.99343416e-01 -2.68563356e-01  8.33004476e-02
   2.14005368e-01 -1.31409749e-01 -1.59429946e-02  3.35905529e-02
   2.49186237e-01 -3.14523993e-01]
 [-1.98963304e-02 -4.98986316e-01 -2.27943647e-01  2.46263383e-01
  -1.34606099e-01 -7.40051629e-02 -8.42131214e-02  8.73943683e-03
  -1.57844000e-01  4.77700771e-01  4.95194773e-01 -1.81162350e-01
   5.54106799e-02 -1.18952339e-01  2.59329804e-02 -1.09117605e-02
  -2.14670998e-01  1.09560024e-01]
 [-6.40050235e-02 -1.92280067e-01  6.31281903e-02  6.17359026e-01
  -3.72235754e-01  3.97702598e-02  2.45879301e-02  1.16001967e-02
   1.27663817e-01  1.19538321e-01 -5.30611988e-01 -7.87119257e-02
   1.15489723e-01  2.62076121e-01 -2.87220063e-02  9.75541498e-03
   1.58600874e-01 -8.06591479e-02]
 [-8.81422146e-03  4.79912556e-02 -1.16637306e-02 -6.52682241e-02
   4.65463698e-02 -2.78523482e-02 -1.27153975e-01  1.13698125e-02
   7.41098090e-01 -4.95439033e-02  1.22510643e-01 -6.36982164e-01
  -1.78511437e-02 -1.97058827e-02  2.75507019e-03 -1.07298562e-02
   2.54517947e-02 -8.35645396e-03]
 [-1.21827909e-02 -1.26805341e-02  1.47676858e-02 -5.03839723e-02
   3.41147706e-02 -8.79207278e-03  8.65177121e-01  2.37205857e-01
  -1.95733447e-01 -1.24556536e-02 -4.60335142e-03 -3.87804290e-01
   9.18873103e-03  8.03467048e-03 -3.50798496e-03 -1.42387890e-02
   4.06177041e-02  1.02334842e-03]]
In [146]:
#percentage of variation explained by each eigen Vector
print(pca.explained_variance_ratio_)
[5.20000712e-01 1.66192840e-01 1.11433739e-01 6.61838553e-02
 4.92117665e-02 2.96971441e-02 1.87130010e-02 1.18373158e-02
 9.13846643e-03 4.81438565e-03 3.84184745e-03 2.58377292e-03
 1.96103411e-03 1.56611829e-03 1.14338942e-03 9.67711000e-04
 5.02696638e-04 2.10204741e-04]
In [147]:
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
In [148]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum Sum of variation explained')
plt.xlabel('eigen Value')
plt.show()
In [149]:
i=1;
while i<19:
    pca = PCA(n_components=i,random_state=seed)
    pca.fit(x_train_PCA)
    print('Principal component of',i,'Captures around ',np.round(pca.explained_variance_ratio_.sum()*100,2),'Percent of Variance in the data')
    i=i+1
Principal component of 1 Captures around  52.0 Percent of Variance in the data
Principal component of 2 Captures around  68.62 Percent of Variance in the data
Principal component of 3 Captures around  79.76 Percent of Variance in the data
Principal component of 4 Captures around  86.38 Percent of Variance in the data
Principal component of 5 Captures around  91.3 Percent of Variance in the data
Principal component of 6 Captures around  94.27 Percent of Variance in the data
Principal component of 7 Captures around  96.14 Percent of Variance in the data
Principal component of 8 Captures around  97.33 Percent of Variance in the data
Principal component of 9 Captures around  98.24 Percent of Variance in the data
Principal component of 10 Captures around  98.72 Percent of Variance in the data
Principal component of 11 Captures around  99.11 Percent of Variance in the data
Principal component of 12 Captures around  99.36 Percent of Variance in the data
Principal component of 13 Captures around  99.56 Percent of Variance in the data
Principal component of 14 Captures around  99.72 Percent of Variance in the data
Principal component of 15 Captures around  99.83 Percent of Variance in the data
Principal component of 16 Captures around  99.93 Percent of Variance in the data
Principal component of 17 Captures around  99.98 Percent of Variance in the data
Principal component of 18 Captures around  100.0 Percent of Variance in the data

Insights:
For Principal component of 7 the model Captures around more than 95% Percent of Variance in the data, so we decide to use the Number of component as 7 and create a new dataframe X_Train_pca7

In [150]:
#n_Components=7 capture about 95% of the variance in the data
pca7 = PCA(n_components=7)
pca7.fit(x_train_PCA)
x_train_pca7 = pca7.transform(x_train_PCA)
In [151]:
pd.DataFrame(x_train_pca7)
Out[151]:
0 1 2 3 4 5 6
0 4.153559 -0.461787 -0.377833 1.269603 1.050148 0.719166 0.532736
1 2.892276 0.677193 0.459446 -0.334097 -0.661037 0.736421 -0.287379
2 -2.931955 -2.884173 -0.352031 2.145320 -0.056691 -0.286230 -0.350695
3 -5.298825 0.783730 -0.654489 2.463252 1.972778 1.299444 -0.089491
4 -1.478930 -2.644776 0.553585 0.327767 -0.860093 -0.404133 0.389771
... ... ... ... ... ... ... ...
587 -0.477857 -1.426177 -0.598000 0.786327 0.482010 -0.203356 -0.591645
588 -1.596782 -2.060841 0.287625 1.346413 -0.245488 0.139889 -0.834658
589 -2.573741 2.935482 -0.941954 -0.667610 0.611014 -0.230041 0.359631
590 4.390161 -0.978755 0.093569 -0.251555 -0.374752 -0.203348 0.111492
591 4.696827 0.022194 0.509139 0.261767 -1.407068 -0.244600 0.605138

592 rows × 7 columns

In [152]:
#n_Components=7 capture about 95% of the variance in the data, Preparing the PCA Test data set for n=7
pca7 = PCA(n_components=7)
pca7.fit(x_test_PCA)
x_test_pca7 = pca7.transform(x_test_PCA)
In [153]:
pd.DataFrame(x_test_pca7)
Out[153]:
0 1 2 3 4 5 6
0 -2.017520 2.175869 1.344064 -0.790433 -0.698341 1.168752 -0.907596
1 7.299035 4.066952 -1.895883 0.578850 0.750181 1.073824 0.432775
2 -3.437859 1.880437 -0.870502 0.963064 0.770652 -1.175937 0.997460
3 -0.912386 0.418286 0.835277 -1.134336 -0.400623 0.703370 -0.966627
4 -1.362233 1.952926 1.497351 0.162725 -0.482447 0.792126 -1.482528
... ... ... ... ... ... ... ...
249 -4.292902 1.221299 -0.191631 0.536359 1.746521 0.580472 -0.719609
250 -2.336706 2.791264 0.118879 0.814263 -0.145917 -0.324569 -0.244693
251 -1.037658 -1.435544 0.523339 1.071520 -0.192588 0.967556 0.648920
252 0.736928 -1.556748 -1.234512 2.362240 0.692789 0.049824 0.162615
253 5.456700 -0.285561 0.043823 0.472919 1.939759 -0.177646 -0.240331

254 rows × 7 columns

7. Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier. (hint: set the same random state) (20 marks)

7a. Train a Support vector machine using the train set and get the accuracy on the test set on the Principal Component

In [154]:
PCA_x_train=x_train_pca7
PCA_y_train=y_train_PCA
PCA_x_test=x_test_pca7
PCA_y_test=y_test_PCA
In [155]:
#Linear Support vector Machine for Principal component
PCA_lsvm = SVC(kernel='linear',random_state=seed)
PCA_lsvm.fit(PCA_x_train,PCA_y_train)
Out[155]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=12,
    shrinking=True, tol=0.001, verbose=False)
In [156]:
print('Train Data Score :',np.round(PCA_lsvm.score(PCA_x_train, PCA_y_train),4))
print('Test Data Score :',np.round(PCA_lsvm.score(PCA_x_test, PCA_y_test),4))
Train Data Score : 0.8294
Test Data Score : 0.7362
In [157]:
#Predict for PCA train set
pred_train = PCA_lsvm.predict(PCA_x_train)

#Confusion Matrix
PCA_lsvm_cm_train = pd.DataFrame(confusion_matrix(PCA_y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_lsvm_cm_train.index.name = "Predicted"
PCA_lsvm_cm_train.columns.name = "True"
PCA_lsvm_cm_train
Out[157]:
True Bus Car Van
Predicted
Bus 118 33 6
Car 34 250 7
Van 5 16 123
In [158]:
#Predict for PCA test set
pred_test = PCA_lsvm.predict(PCA_x_test)

#Confusion Matrix
PCA_lsvm_cm_test = pd.DataFrame(confusion_matrix(PCA_y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_lsvm_cm_test.index.name = "Predicted"
PCA_lsvm_cm_test.columns.name = "True"
PCA_lsvm_cm_test
Out[158]:
True Bus Car Van
Predicted
Bus 46 16 6
Car 14 100 16
Van 1 14 41
In [159]:
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Linear Support vector machine for test data: \n")
ax=sns.heatmap(PCA_lsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
In [160]:
#summarize the fit of the model
PCA_lsvm_accuracy = np.round( metrics.accuracy_score( PCA_y_test, pred_test ), 4 )
#PCA_lsvm_precision = np.round( metrics.precision_score( PCA_y_test, pred_test ,average=None), 4 )
#PCA_lsvm_recall    = np.round( metrics.recall_score( PCA_y_test, pred_test,average=None ), 4 )
#PCA_lsvm_f1score = np.round( metrics.f1_score( PCA_y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', PCA_lsvm_accuracy)
print('\n') 
print('Metrics Classification Report for Linear Support vector machine regression\n',metrics.classification_report(PCA_y_test, pred_test))
Total Accuracy :  0.7362


Metrics Classification Report for Linear Support vector machine regression
               precision    recall  f1-score   support

         0.0       0.68      0.75      0.71        61
         1.0       0.77      0.77      0.77       130
         2.0       0.73      0.65      0.69        63

    accuracy                           0.74       254
   macro avg       0.73      0.72      0.72       254
weighted avg       0.74      0.74      0.74       254

In [161]:
#Poly Support vector Machine for Principal component
PCA_psvm = SVC(kernel='poly',gamma='auto',random_state=seed)
PCA_psvm.fit(PCA_x_train,PCA_y_train)
Out[161]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
    max_iter=-1, probability=False, random_state=12, shrinking=True, tol=0.001,
    verbose=False)
In [162]:
print('Train Data Score :',np.round(PCA_psvm.score(PCA_x_train, PCA_y_train),4))
print('Test Data Score :',np.round(PCA_psvm.score(PCA_x_test, PCA_y_test),4))
Train Data Score : 0.9003
Test Data Score : 0.748
In [163]:
#Predict for PCA train set
pred_train = PCA_psvm.predict(PCA_x_train)

#Confusion Matrix
PCA_psvm_cm_train = pd.DataFrame(confusion_matrix(PCA_y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_psvm_cm_train.index.name = "Predicted"
PCA_psvm_cm_train.columns.name = "True"
PCA_psvm_cm_train
Out[163]:
True Bus Car Van
Predicted
Bus 129 4 0
Car 28 286 18
Van 0 9 118
In [164]:
#Predict for PCA test set
pred_test = PCA_psvm.predict(PCA_x_test)

#Confusion Matrix
PCA_psvm_cm_test = pd.DataFrame(confusion_matrix(PCA_y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_psvm_cm_test.index.name = "Predicted"
PCA_psvm_cm_test.columns.name = "True"
PCA_psvm_cm_test
Out[164]:
True Bus Car Van
Predicted
Bus 40 6 2
Car 20 109 20
Van 1 15 41
In [165]:
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for Poly Support vector machine for test data: \n")
ax=sns.heatmap(PCA_psvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
In [166]:
#summarize the fit of the model
PCA_psvm_accuracy = np.round( metrics.accuracy_score( PCA_y_test, pred_test ), 4 )
#PCA_psvm_precision = np.round( metrics.precision_score( PCA_y_test, pred_test ,average=None), 4 )
#PCA_psvm_recall    = np.round( metrics.recall_score( PCA_y_test, pred_test,average=None ), 4 )
#PCA_psvm_f1score = np.round( metrics.f1_score( PCA_y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', PCA_psvm_accuracy)
print('\n') 
print('Metrics Classification Report for Poly Support vector machine regression\n',metrics.classification_report(PCA_y_test, pred_test))
Total Accuracy :  0.748


Metrics Classification Report for Poly Support vector machine regression
               precision    recall  f1-score   support

         0.0       0.83      0.66      0.73        61
         1.0       0.73      0.84      0.78       130
         2.0       0.72      0.65      0.68        63

    accuracy                           0.75       254
   macro avg       0.76      0.71      0.73       254
weighted avg       0.75      0.75      0.75       254

In [167]:
#RBF Support vector Machine for Principal component
PCA_rsvm = SVC(kernel='rbf',random_state=seed,gamma='auto')
PCA_rsvm.fit(PCA_x_train,PCA_y_train)
Out[167]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=12, shrinking=True, tol=0.001,
    verbose=False)
In [168]:
print('Train Data Score :',np.round(PCA_rsvm.score(PCA_x_train, PCA_y_train),4))
print('Test Data Score :',np.round(PCA_rsvm.score(PCA_x_test, PCA_y_test),4))
Train Data Score : 0.9662
Test Data Score : 0.8268
In [169]:
#Predict for PCA train set
pred_train = PCA_rsvm.predict(PCA_x_train)

#Confusion Matrix
PCA_rsvm_cm_train = pd.DataFrame(confusion_matrix(PCA_y_train,pred_train).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_rsvm_cm_train.index.name = "Predicted"
PCA_rsvm_cm_train.columns.name = "True"
PCA_rsvm_cm_train
Out[169]:
True Bus Car Van
Predicted
Bus 155 2 0
Car 2 288 7
Van 0 9 129
In [170]:
#Predict for PCA test set
pred_test = PCA_rsvm.predict(PCA_x_test)

#Confusion Matrix
PCA_rsvm_cm_test = pd.DataFrame(confusion_matrix(PCA_y_test,pred_test).T,index=['Bus', 'Car','Van'], columns=['Bus', 'Car','Van'])
PCA_rsvm_cm_test.index.name = "Predicted"
PCA_rsvm_cm_test.columns.name = "True"
PCA_rsvm_cm_test
Out[170]:
True Bus Car Van
Predicted
Bus 55 8 4
Car 4 109 13
Van 2 13 46
In [171]:
plt.figure(figsize = (4,4))
plt.title("Confusion Matrix for RBF Support vector machine for test data: \n")
ax=sns.heatmap(PCA_rsvm_cm_test, annot=True,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()
In [172]:
#summarize the fit of the model
PCA_rsvm_accuracy = np.round( metrics.accuracy_score( PCA_y_test, pred_test ), 4 )
#PCA_rsvm_precision = np.round( metrics.precision_score( PCA_y_test, pred_test ,average=None), 4 )
#PCA_rsvm_recall    = np.round( metrics.recall_score( PCA_y_test, pred_test,average=None ), 4 )
#PCA_rsvm_f1score = np.round( metrics.f1_score( PCA_y_test, pred_test,average=None ), 4 )
print( 'Total Accuracy : ', PCA_rsvm_accuracy)
print('\n') 
print('Metrics Classification Report for RBF Support vector machine regression\n',metrics.classification_report(PCA_y_test, pred_test))
Total Accuracy :  0.8268


Metrics Classification Report for RBF Support vector machine regression
               precision    recall  f1-score   support

         0.0       0.82      0.90      0.86        61
         1.0       0.87      0.84      0.85       130
         2.0       0.75      0.73      0.74        63

    accuracy                           0.83       254
   macro avg       0.81      0.82      0.82       254
weighted avg       0.83      0.83      0.83       254

In [173]:
SVMresult_After_PCA = pd.DataFrame({'Model' : ['SVM Linear', 'SVM Polynomial', 'SVM RBF'], 
                       'Model Accuracy After PCA' : [PCA_lsvm_accuracy, PCA_psvm_accuracy, PCA_rsvm_accuracy],
                      })
SVMresult_After_PCA
Out[173]:
Model Model Accuracy After PCA
0 SVM Linear 0.7362
1 SVM Polynomial 0.7480
2 SVM RBF 0.8268

Insights:
From the above results, Radial basis function SVM Model trained using the Xpca df gives the higher Accuracy when compared to the Linear and the Polynomial SVM Model

7b. Perform K-fold cross validation and get the cross validation score of the model for Principal Component

In [174]:
#K fold Cross validation
PCA_kf=KFold(n_splits= 10, random_state = seed)
PCA_lsvm_results = cross_val_score(estimator = PCA_lsvm, X = PCA_x_train, y = PCA_y_train, cv = PCA_kf)
PCA_lsvm_PCA_kf_accuracy=PCA_lsvm_results.mean()
print(PCA_lsvm_PCA_kf_accuracy)
0.8092372881355934
In [175]:
#K fold Cross validation
PCA_psvm_results = cross_val_score(estimator = PCA_psvm, X = PCA_x_train, y = PCA_y_train, cv = PCA_kf)
PCA_psvm_PCA_kf_accuracy=PCA_psvm_results.mean()
print(PCA_psvm_PCA_kf_accuracy)
0.8597175141242938
In [176]:
#K fold Cross validation
PCA_rsvm_results = cross_val_score(estimator = PCA_rsvm, X = PCA_x_train, y = PCA_y_train, cv = PCA_kf)
PCA_rsvm_PCA_kf_accuracy=PCA_rsvm_results.mean()
print(PCA_rsvm_PCA_kf_accuracy)
0.9138700564971751
In [177]:
Cross_Validation_Score_After_PCA = pd.DataFrame({'Model' : ['SVM Linear KF', 'SVM Polynomial KF', 'SVM RBF KF'], 
                       ' Cross validation Score After PCA' : [PCA_lsvm_PCA_kf_accuracy, PCA_psvm_PCA_kf_accuracy, PCA_rsvm_PCA_kf_accuracy],
                      })
Cross_Validation_Score_After_PCA
Out[177]:
Model Cross validation Score After PCA
0 SVM Linear KF 0.809237
1 SVM Polynomial KF 0.859718
2 SVM RBF KF 0.913870

Insights:
From the above results, Radial basis function SVM Model trained using the Xpca7 Data gives the highest Average Accuracy in the K-Fold Validation when compared to the Linear and the Polynomial SVM Model

8. Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings (5 points)

In [178]:
svm_result=pd.merge(SVMresult_Before_PCA, SVMresult_After_PCA,on='Model')
In [179]:
svm_result
Out[179]:
Model Model Accuracy Before PCA Model Accuracy After PCA
0 SVM Linear 0.9291 0.7362
1 SVM Polynomial 0.8228 0.7480
2 SVM RBF 0.9449 0.8268

From the above results, we can see that the RBF SVM model has higher accuracy when compared to the remaining model. By reducing the dimensionality from 18 to 7, we have dropped around 11% Percent model accuracy.

In [180]:
Cross_Validation_Score_Result=pd.merge(Cross_Validation_Score_Before_PCA, Cross_Validation_Score_After_PCA,on='Model')
In [181]:
Cross_Validation_Score_Result
Out[181]:
Model Cross validation Score Before PCA Cross validation Score After PCA
0 SVM Linear KF 0.959435 0.809237
1 SVM Polynomial KF 0.788785 0.859718
2 SVM RBF KF 0.964463 0.913870

From the above Cross validation score results, we can see that the RBF SVM model has highest average accuracy when compared to the remaining model. By reducing the dimensionality from 18 to 7, we see only 4-5% drop in the average accuracy